Table of Contents

Introduction

Facebook friendship visualization by Paul Butler

‘Facebook friendship visualization by Paul Butler’

Often times geospatial data is best understood as exchanges or interactions. Peter Butler’s popular visualization of Facebook friendships provides a powerfully illustrates both the global popularity Facebook and some of the unique societal connections enabled by the social networking giant. Facebook in many ways has removed the need for geospatial proximity to maintain friendships.

Many forms of geospatial data are best expressed as interactions such as: financial transactions, SMS messages, shipping routes, etc. This tutorial will illustrate how to visualize geospatial networks at scale using R and the skopeABI package.

In this tutorial we will visualize 5007 flights available from FlowingData(Yau 2015) which we have made available within the skopeABI package as raw_flight_data. We will organize the tutorial using our standard four phase analytic pipeline doctrine:

  • Collection: We provide code highlighting how we collected our raw data for the tutorial, as well as how to access the data from within the skopeABI package.
  • Explore, Transform, Load (ETL): Within this phase we will provide a basic summary of the data, wask through how to generate a graph from our raw data. We provide the flight data in graph form within the skopeABI package as well. We then tranform the graph edgelist into proper format for plotting arcs, and then generate arcs and convert our edgelist into long format for plotting.
  • Model: In this phase we use network analysis to generate insight into airports and traffic flow within our data.
  • Communicate: In this phase we walk through how to generate maps useful for at scale geospatial network visualization.

Collection

For this tutorial we will use flight data available from FlowingData which we have made available within the skopeABI package as raw_flight_data.

airports <- read.csv("http://datasets.flowingdata.com/tuts/maparcs/airports.csv", 
                     stringsAsFactors=FALSE,header=TRUE)
flights <- read.csv("http://datasets.flowingdata.com/tuts/maparcs/flights.csv", 
                    stringsAsFactors=FALSE,header=TRUE, as.is=TRUE)

Explore, Transform, Load (ETL)

Exploration

The raw data consists of two data frames. airports consists of 7 attributes associated with 3379 airports within the united states:

airports=raw_flight_data$airports
str(airports)
## 'data.frame':    3379 obs. of  7 variables:
##  $ iata   : chr  "00M" "00R" "00V" "01G" ...
##  $ airport: chr  "Thigpen " "Livingston Municipal" "Meadow Lake" "Perry-Warsaw" ...
##  $ city   : chr  "Bay Springs" "Livingston" "Colorado Springs" "Perry" ...
##  $ state  : chr  "MS" "TX" "CO" "NY" ...
##  $ country: chr  "USA" "USA" "USA" "USA" ...
##  $ lat    : num  32 30.7 38.9 42.7 30.7 ...
##  $ long   : num  -89.2 -95 -104.6 -78.1 -81.9 ...

The second item in the raw_flight_data list is a flights data frame consisting of 5007 flights expressed using 4 attributes:

flights=raw_flight_data$flights
str(flights)
## 'data.frame':    5007 obs. of  4 variables:
##  $ airline : chr  "AA" "AA" "AA" "AA" ...
##  $ airport1: chr  "DFW" "MSP" "LGA" "TPA" ...
##  $ airport2: chr  "SJU" "DFW" "ORD" "JFK" ...
##  $ cnt     : int  120 326 860 56 44 550 496 200 214 50 ...

Transformation

Data cleaning and transformation is often needed to make our data most useful. Although the data cleaning for this tutorial is already complete, we do need to make several transformations. We first need to generate a graph object using the igraph package. This will enable us to conduct network analysis of our flight data. We then have to convert the flights edgelist to contain the corresponding lat/lon for each airport expressed in a given edge. Finally we need to generate arc data in long format for visualization.

Data Cleaning

Several flights associated with airports in East Asia cross the anti-meridian causing problems with out map boundaries.

##                           airport       lat     long
## 2797                  Prachinburi 14.078333 101.3783
## 2798             Babelthoup/Koror  7.367222 134.5442
## 3004 Tinian International Airport 14.996111 145.6214
## 3359            Yap International  9.516700 138.1000
##                             country
## 2797                       Thailand
## 2798                          Palau
## 3004              N Mariana Islands
## 3359 Federated States of Micronesia

For the purpose of this tutorial we will simply remove these flights; however, we will address geospatial networks that cross the anti-meridian in a subsequent tutorial.

# remove airports that are not inside the United States
airports=airports[airports$country == 'USA',]
str(airports)
## 'data.frame':    3375 obs. of  7 variables:
##  $ iata   : chr  "00M" "00R" "00V" "01G" ...
##  $ airport: chr  "Thigpen " "Livingston Municipal" "Meadow Lake" "Perry-Warsaw" ...
##  $ city   : chr  "Bay Springs" "Livingston" "Colorado Springs" "Perry" ...
##  $ state  : chr  "MS" "TX" "CO" "NY" ...
##  $ country: chr  "USA" "USA" "USA" "USA" ...
##  $ lat    : num  32 30.7 38.9 42.7 30.7 ...
##  $ long   : num  -89.2 -95 -104.6 -78.1 -81.9 ...
# remove flights associated with non-US airports
flights=flights[flights$airport1 %in% airports$iata & flights$airport2 %in% airports$iata,]
str(flights)
## 'data.frame':    5005 obs. of  4 variables:
##  $ airline : chr  "AA" "AA" "AA" "AA" ...
##  $ airport1: chr  "DFW" "MSP" "LGA" "TPA" ...
##  $ airport2: chr  "SJU" "DFW" "ORD" "JFK" ...
##  $ cnt     : int  120 326 860 56 44 550 496 200 214 50 ...

Generate Geospatial Graph G

Our first step is to generate an empty graph and then assign our 7 attributes to the vertices of the graph:

G=graph.empty(n=length(airports$iata),directed=TRUE)
  V(G)$name=airports$iata
  V(G)$airport=airports$airport
  V(G)$lat=airports$lat
  V(G)$lon=airports$lon
  V(G)$city=airports$city
  V(G)$state=airports$state
  V(G)$country=airports$country
vcount(G)
## [1] 3375

We then add flights as edges along with 2 additional attributes cnt and airline. We label cnt as weight to facilitate weighted graph functionality within the igraph package.

G=add_edges(G,edges=rbind(flights$airport1,flights$airport2))
E(G)$weight=flights$cnt
E(G)$airline=flights$airline

# delete vertices with no flight data
G=delete_vertices(G,v=degree(G,mode='total')==0)
vcount(G)
## [1] 282
ecount(G)
## [1] 5005

Similar to the raw_flight_data object. We make the graph above available within skopeABI as flight_graph

Generate Edges with Geocoordinates

Currently our graph, G, has its edges expressed by iata.

head(get.edgelist(G,names=TRUE))
##      [,1]  [,2] 
## [1,] "DFW" "SJU"
## [2,] "MSP" "DFW"
## [3,] "LGA" "ORD"
## [4,] "TPA" "JFK"
## [5,] "STT" "BOS"
## [6,] "PHX" "DFW"

However, to generate arcs for our visualization, we will need the lat/lon for the origin and destination airports for each flight. For this we provide the buildEdgeCoordDF function:

  • Function: buildEdgeCoordDF(G,attributes)
  • Purpose: given a graph with geocoordinates as node attributes, generate an edgelist with source lat/lon and target lat/lon as well as attributes of of the user’s choosing.
  • Input: a graph with node attributes of lat and lon; a list of edge attibutes to be included at column features in the functions output.
  • Output: a data frame with the graphs original edgelist as well as coordinates for the endpoints of each edge.
arc_point_df=buildEdgeCoordDF(G,attributes=list(airline=E(G)$airline,weight=E(G)$weight))
head(arc_point_df)
##   source target airline weight sourceLat  sourceLon targetLat targetLon
## 1    DFW    SJU      AA    120  32.89595  -97.03720  18.43942 -66.00183
## 2    MSP    DFW      AA    326  44.88055  -93.21692  32.89595 -97.03720
## 3    LGA    ORD      AA    860  40.77724  -73.87261  41.97960 -87.90446
## 4    TPA    JFK      AA     56  27.97547  -82.53325  40.63975 -73.77893
## 5    STT    BOS      AA     44  18.33731  -64.97336  42.36435 -71.00518
## 6    PHX    DFW      AA    550  33.43417 -112.00806  32.89595 -97.03720

Generating Arcs in Long Format

To express these flights as arcs on a map requires us to enrich and transform the endpoint depicted in our flight_data data frame. Although several tutorials provide examples of plotting relatively small transactional data sets in R, they often use loops to generate each are as an individual layer which will cause problems with batch limits at scale (for example see this RBloggers Tutorial (“Create Air Travel Route Maps in Ggplot: A Visual Travel Diary. R-Bloggers” 2017)). This limitation is easily overcome by converting our data structure from wide to long format (see the RCookbook (“Converting Data Between Wide and Long Format” 2019)). We provide the arcMelt function to handle arc generation and visualization at scale.

  • Function: arcMelt(arc_point_df,n,cores,weighted)
  • Purpose: this function builds a melted data frame of indexed arc points. The function enables plotting of flights, transactions, or other exchanges at scale.
  • Input: a data frame where a row represents endpoints of an arc with index and count (optional) see flight_data for an example.
  • Output: a melted data frame where each row is a point corresponding to a specific arc.
# Build melted arc data
arc_data=arcMelt(arc_point_df,n=5,cores=1,weighted=TRUE)
head(arc_data)
##          lon      lat index weight
## 1: -97.03720 32.89595     1    120
## 2: -91.30580 30.99844     1    120
## 3: -85.81177 28.85820     1    120
## 4: -80.54996 26.50462     1    120
## 5: -75.50737 23.96607     1    120
## 6: -70.66540 21.26935     1    120

Load

In addition to the arc_data data.frame, we need to load the necessary shape files to construct our map layer. We provide wrapper around the package rnaturalearth which provides a map of countries of the entire world. Use ne_countries to pull country data and choose the scale (rnaturalearthhires is necessary for scale = “large”). The function can return sp classes (default) or directly sf classes, as defined in the argument returnclass. The code below provides a based worldmap.

#construct map w/rnaturalearth and sf
world <- ne_countries(scale = "medium", returnclass = "sf")

m=ggplot(data = world) +
  geom_sf(colour=gray(.5),size=.05,fill=gray(.2))

m
TRUE

TRUE

Although we now have basic map data, we must adjust several parameters to develop meaningful visualizations of graphs with large numbers of edges.

Modeling: Network Analysis

The concept of centrality within networks attempts to identify the most important vertices or nodes within a network. By modeling our flight data as a graph, this enables us to identify the most important airports within the US based on graph metrics that account for more than simple counts.

We develop three additional features/columns in our airports data frame base on graph centrality measures and construct an interactive plot below.

  • degree centrality - degree centrality simply sums the inbound and output edge weights at a given node. In our case this is the sum of all of the edges corresponding to inbound and outbound flights. The size of the points in the plot below corresponds to degree.
  • eigenvector centrality - eigen vector centrality attempts to measure how well connected a given node is. The assumption is that each node’s centrality is the sum of the centrality values of the nodes that it is connected to. The x-axis of the plot below corresponds to eigenvector centrality.
  • betweenness centrality - In graph theory, betweenness centrality is a measure of centrality in a graph based on shortest paths. … The betweenness centrality for each vertex is the number of these shortest paths that pass through the vertex. The y-axis of the plot below corresponds to betweenness centrality.

Hover over each data point to identify the corresponding airport.

# remove airports that do not have flight data

airports=airports[ airports$iata %in% V(G)$name, ]

# add node metrics to airports data.frame
airports$ev_centrality=eigen_centrality(G,directed=TRUE,scale=TRUE)$vector
airports$degree=degree(G,v=V(G),mode='total')
airports$betweenness=betweenness(G,v=V(G),directed=TRUE,weights=E(G)$weight,normalized=TRUE)

# construct interactive plot
f <- list(
  family = "Courier New, monospace",
  size = 18,
  color = "#7f7f7f")

x <- list(
  title = "eigenvector centrality",
  titlefont = f)

y <- list(
  title = "betweenness",
  titlefont = f)
  
p <- plot_ly(type = 'scatter', mode = 'markers') %>%
  add_trace(
    x = airports$ev_centrality, 
    y = airports$betweenness,
    size=airports$degree/max(airports$degree)+.25,
    text = airports$airport,
    hoverinfo = 'text',
    marker = list(color='#a8262b',
    line = list(
        color = 'black',
        width = 1)),
    showlegend = F) %>%
  layout(xaxis = x, yaxis = y)

p

TRUE

It is also interesting to look at nodes with anomalous characteristics. In the visualization below we use mahanolobis distance based on node betweenness and eigenvector centrality to idenfity nodes that have unusual characteristics. The y-axis in the graphic below corresponds to mahanolobis distance of our airports eigenvector and betweenness centrality measures. If we think of our airport’s features as a point p in 2D space where x corresponds to eigenvector centrality, and y corresponds to betweenness centrality. Mahalanobis distance measures the distance between an airport’s point p and the center of our entire data set’s distribution. Introduced by P. C. Mahalanobis in 1936, it is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D. You can see a clear parabolic trend in the visualization. Points above this trendline are worthy of manual inspection.

TRUE

Communicate: GeoNet Visualization

Now we look to visualize our network geospatially. To do so we must adjust the appearance of our map layer to enable effective visualization and ensure our arcs are scaled appropriately for effective visualization.

Set a Dark Map Layer and Set boundaries

Visualizing geospatial networks at scale requires strong contrast between the map layer and the network layer of the visualization. For this purpose we provide a setGeoNetworkTheme function that provides a dark map with subtle boundaries.

Large scale in terms of the number of arcs to be plotted requires auto bounding of the map layer. For this purpose we provide a getMapBoundaries function to easily enable the analyst to set a buffer margin and use the data to define the plot region.

#construct map w/rnaturalearth and sf
setGeoNetworkTheme()
map_boundaries=getMapBoundaries(arc_data,buffer=0.05)

m=ggplot(data = world) +
  
  # define map boundaries
  xlim(map_boundaries$lon[1],map_boundaries$lon[2])+
  ylim(map_boundaries$lat[1],map_boundaries$lat[2])+
  
  # plot map layer
  geom_sf(colour=gray(.5),size=.05,fill=gray(.2))+
  
  # add arcs
  geom_line(aes(x=lon,y=lat,group=index),
            size=1,
            color='green1',
            data=arc_data)+
  
  
  theme(legend.position = 'none')
TRUE

TRUE

Finally we must adjust the width and transparancy of each arc to highlight structure within the plot. To do so we adjust parameters within the geom_line portion below by adding the ‘alpha’ parameter for transparency as a function of the count per arc (i.e. passenger volume). We also adjust the size parameter of the geom_line function. Finally, we use the geom_point function to plot points for all of the airports within our data and adjust for scale based on Mahalanobis Distance.

m=ggplot(data = world) +
  xlim(map_boundaries$lon[1],map_boundaries$lon[2])+
  ylim(map_boundaries$lat[1],map_boundaries$lat[2])+
  geom_sf(colour=gray(.5),size=.05,fill=gray(.2))+
    geom_line(aes(x=lon,y=lat,group=index,alpha=weight),
            size=.05,
            color='green1',
            data=arc_data)+
  geom_point(aes(x=long,y=lat,size=mahalanobis),color='cyan',
             data=airports)+
  scale_size(range=c(.005,1))+
  theme(legend.position = 'none')
TRUE

TRUE

Conclusion

In this tutorial we have walked through how to generate geospatial network visualizations in R using a few simpe wrappers and data provided in the skopeABI package. We hope you find this tutorial useful, and welcome feedback and contributions to our open development project.

Works Cited

“Converting Data Between Wide and Long Format.” 2019. Accessed December 31. http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/.

“Create Air Travel Route Maps in Ggplot: A Visual Travel Diary. R-Bloggers.” 2017. March 6. https://www.r-bloggers.com/create-air-travel-route-maps-in-ggplot-a-visual-travel-diary/.

Yau, Nathan. 2015. “Thanksgiving Flight Patterns. FlowingData.” November 25. https://flowingdata.com/2015/11/25/thanksgiving-flight-patterns/.